5 research outputs found
Speeding up VP9 Intra Encoder with Hierarchical Deep Learning Based Partition Prediction
In VP9 video codec, the sizes of blocks are decided during encoding by
recursively partitioning 6464 superblocks using rate-distortion
optimization (RDO). This process is computationally intensive because of the
combinatorial search space of possible partitions of a superblock. Here, we
propose a deep learning based alternative framework to predict the intra-mode
superblock partitions in the form of a four-level partition tree, using a
hierarchical fully convolutional network (H-FCN). We created a large database
of VP9 superblocks and the corresponding partitions to train an H-FCN model,
which was subsequently integrated with the VP9 encoder to reduce the intra-mode
encoding time. The experimental results establish that our approach speeds up
intra-mode encoding by 69.7% on average, at the expense of a 1.71% increase in
the Bjontegaard-Delta bitrate (BD-rate). While VP9 provides several built-in
speed levels which are designed to provide faster encoding at the expense of
decreased rate-distortion performance, we find that our model is able to
outperform the fastest recommended speed level of the reference VP9 encoder for
the good quality intra encoding configuration, in terms of both speedup and
BD-rate
Recommended from our members
Deep learning solutions for video encoding and streaming
Video data has emerged as the top contributor to the global internet traffic, and video compression is the key technology that enables its efficient storage, transmission and retrieval. As the video compression technology advances to keep pace with the proliferation of video data, state of the art video codecs that rely on block based hybrid coding tend to become increasingly complex and computationally intensive. Moreover, currently, it appears challenging to significantly improve video compression efficiency by solely relying on traditional approaches. Consequently, deep learning techniques are being extensively explored in the context of designing video compression technologies. My research addresses the problem of making the benefits of data driven deep learning accessible to some key areas of video coding and compression based video streaming technologies.
First, this dissertation introduces the deep learning framework to speed up intra mode encoding in the VP9 video codec. In VP9 , the sizes of blocks are decided by a computationally intensive rate-distortion optimization (RDO) process, that evaluates the combinatorially complex search space of possible partitions of 64 × 64 superblocks. We devised a learning based alternative framework to predict the intra-mode superblock partitions using a hierarchical fully convolutional network (H-FCN), that was experimentally shown to speed up the intra-mode encoding of the reference VP9 encoder. Subsequently, our work on deep learning based block motion estimation is expounded. Block based motion estimation is essential for performing inter-prediction in hybrid codecs, a mechanism which is responsible for bulk of the compression capability achieved by it. However, prevalent block matching based procedures that are used to compute block motion vectors (MVs) are computationally intensive, are prone to detecting spurious motions which worsen at smaller block sizes, and are agnostic to the perceptual quality of the predicted frames. To address these issues, we developed a composite block translation network (CBT-Net) that jointly predicts the MVs of blocks having multiple sizes by using the MVs predicted for larger blocks to guide the motion estimation of smaller blocks. Our framework produces more coherent motion fields at smaller block sizes as compared to traditional block matching based MV estimation, and is also computationally efficient. Its rate-distortion performance gains are demonstrated for AV1 encoding.
The last part of this dissertation focuses on learning based approaches in the context of designing compression based adaptive video streaming. Adaptive video streaming relies on the construction of efficient bitrate ladders to deliver the best possible visual quality to viewers under bandwidth constraints. The traditional method of content dependent bitrate ladder selection requires a video shot to be pre-encoded with multiple encoding parameters to find the optimal operating points given by the convex hull of the resulting rate-quality curves. However, this pre-encoding step causes significant overhead in terms of both computation and time expenditure. To reduce this overhead, we employed a recurrent convolutional network (RCN) to implicitly analyze the spatiotemporal complexity of video shots in order to predict their convex hulls. The proposed RCN-Hull model substantially reduced the pre-encoding time needed for convex hull generation while closely approximating the optimal convex hulls. The competitive advantage of our method over existing ones based on heuristics or feature based machine learning was also demonstrated. The different deep learning frameworks that we introduced in this dissertation thus attest to the compelling advantages offered by deep learning based tools and techniques in driving the development and deployment of future video coding and streaming technologies.Electrical and Computer Engineerin
Efficient Per-Shot Convex Hull Prediction By Recurrent Learning
Adaptive video streaming relies on the construction of efficient bitrate
ladders to deliver the best possible visual quality to viewers under bandwidth
constraints. The traditional method of content dependent bitrate ladder
selection requires a video shot to be pre-encoded with multiple encoding
parameters to find the optimal operating points given by the convex hull of the
resulting rate-quality curves. However, this pre-encoding step is equivalent to
an exhaustive search process over the space of possible encoding parameters,
which causes significant overhead in terms of both computation and time
expenditure. To reduce this overhead, we propose a deep learning based method
of content aware convex hull prediction. We employ a recurrent convolutional
network (RCN) to implicitly analyze the spatiotemporal complexity of video
shots in order to predict their convex hulls. A two-step transfer learning
scheme is adopted to train our proposed RCN-Hull model, which ensures
sufficient content diversity to analyze scene complexity, while also making it
possible capture the scene statistics of pristine source videos. Our
experimental results reveal that our proposed model yields better
approximations of the optimal convex hulls, and offers competitive time savings
as compared to existing approaches. On average, the pre-encoding time was
reduced by 58.0% by our method, while the average Bjontegaard delta bitrate
(BD-rate) of the predicted convex hulls against ground truth was 0.08%, while
the mean absolute deviation of the BD-rate distribution was 0.44